In [2]:
from src.data_management.RecSys2019Reader import *
import pandas as pd
import numpy as np
import scipy.sparse as sps
import matplotlib.pyplot as plt
from src.data_management.data_getter import *
from src.plots.plot_evaluation_helper import *
from course_lib.Data_manager.DataReader_utils import merge_ICM
from src.data_management.RecSys2019Reader import RecSys2019Reader
from src.data_management.RecSys2019Reader_utils import merge_UCM
from src.data_management.data_reader import read_target_users, read_URM_cold_all, read_UCM_cold_all
from src.model import best_models
from src.plots.plot_evaluation_helper import plot_popularity_discretized
from src.data_management.data_getter import get_popular_items
from scipy.stats import skew
from src.model import new_best_models

from src.data_management.DataPreprocessing import DataPreprocessingDiscretization, \
    DataPreprocessingImputation, DataPreprocessingFeatureEngineering, DataPreprocessingTransform
In [3]:
data_reader = RecSys2019Reader("../../data/")
data_reader.load_data()
URM_all = data_reader.get_URM_all()
ICM_categorical = data_reader.get_ICM_from_name("ICM_sub_class")

UCM_age = data_reader.get_UCM_from_name("UCM_age")
UCM_region = data_reader.get_UCM_from_name("UCM_region")
UCM_all, _ = merge_UCM(UCM_age, UCM_region, {}, {})
RecSys2019Reader: WARNING --> There is no verification in the consistency of UCMs
DataReader: Verifying data consistency...
DataReader: Verifying data consistency... Passed!
DataReader: current dataset is: <class 'src.data_management.RecSys2019Reader.RecSys2019Reader'>
	Number of items: 18495
	Number of users: 27255
	Number of interactions in URM_all: 398636
	Interaction density: 7.91E-04
	Interactions per user:
		 Min: 1.00E+00
		 Avg: 1.46E+01
		 Max: 6.77E+02
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 2.16E+01
		 Max: 8.99E+02
	Gini Index: 0.54

Do we have to predict them?

This question have not been already explored in the previous notebook of target users analysis.

In [11]:
df_target = pd.read_csv("../data/data_target_users_test.csv")
target_users = df_target.user_id.values
URM_cold_all = read_URM_cold_all("../data/data_train.csv")
target_URM = URM_cold_all[target_users]
In [12]:
user_act = (target_URM > 0).sum(axis=1)
user_act = np.array(user_act).squeeze()
user_act = np.sort(user_act)
In [13]:
threshold_list = [0, 10, 20, 30, 40, 1000]
plot_popularity_discretized(user_act, threshold_list, y_label="Percentage of user popularity")
We have 3656 array elems with a 0 interactions
We have 13335 array elems with interactions in (0, 10]
We have 5412 array elems with interactions in (10, 20]
We have 2552 array elems with interactions in (20, 30]
We have 1330 array elems with interactions in (30, 40]
We have 1821 array elems with interactions in (40, 1000]
Out[13]:
(array([ 3656, 13335,  5412,  2552,  1330,  1821]),
 array([0.13007899, 0.47445385, 0.19255675, 0.09079912, 0.04732086,
        0.06479044]))
In [14]:
plt.plot(user_act, 'ro')
plt.xlabel('User index')
plt.ylabel('Number of interactions')
plt.show()
In [15]:
np.array(URM_all.sum(axis=1)).squeeze()[np.array(URM_all.sum(axis=1)).squeeze() > 40].size - user_act[user_act > 40].size
Out[15]:
125

There are 125 warm users (with more than 40 activies) that are not in the test set. Let's see for the one very warm

In [16]:
np.array(URM_all.sum(axis=1)).squeeze()[np.array(URM_all.sum(axis=1)).squeeze() > 140].size - user_act[user_act > 140].size
Out[16]:
5

5 over 55 of the very warm should not be predicted. Who are they? What is their activity? Are they very very warm (i.e. > 140) or only warm (> 40 < 140)?

What is the structure of users with very a big number interactions? Why predictions on them are so bad?

Taking inspirations from the MAP graphics of many recommenders, we want to study how users with more than 40 interactions are structured.

In [17]:
very_warm_users_mask = np.ediff1d(URM_all.tocsr().indptr) > 40
very_warm_users = np.arange(URM_all.shape[0])[very_warm_users_mask]
very_warm_users.size
Out[17]:
1946

We should note first of all that these users never disapper while splitting, since we are taking out very few ratings, therefore, there is no variability in the different splits, in this sense.
First of all, let's recap, the general distribution, and how much different is their number w.r.t. other, less active, users

In [18]:
user_act = (URM_all > 0).sum(axis=1)
user_act = np.array(user_act).squeeze()
user_act = np.sort(user_act)
threshold_list = [0, 10, 20, 30, 40, 1000]
plot_popularity_discretized(user_act, threshold_list, y_label="Percentage of user popularity")
We have 0 array elems with a 0 interactions
We have 15314 array elems with interactions in (0, 10]
We have 5814 array elems with interactions in (10, 20]
We have 2745 array elems with interactions in (20, 30]
We have 1436 array elems with interactions in (30, 40]
We have 1946 array elems with interactions in (40, 1000]
Out[18]:
(array([    0, 15314,  5814,  2745,  1436,  1946]),
 array([0.        , 0.56187855, 0.21331866, 0.10071547, 0.05268758,
        0.07139974]))
In [19]:
# Among this popular users, what is the distribution?
very_warm_URM = URM_all[very_warm_users_mask]
very_warm_user_act = (very_warm_URM > 0).sum(axis=1)
very_warm_user_act = np.array(very_warm_user_act).squeeze()
very_warm_user_act = np.sort(very_warm_user_act)
plt.plot(user_act)
plt.xlabel('User index')
plt.ylabel('Number of interactions')
plt.show()

As we can see, the distribution is higly skewed, let's try to get rid of these users and divide them.

In [20]:
user_act = (very_warm_URM > 0).sum(axis=1)
user_act = np.array(user_act).squeeze()
user_act = np.sort(user_act)
threshold_list = [40, 70, 100, 140, 190, 1000]
plot_popularity_discretized(user_act, threshold_list, y_label="Percentage of user popularity")
We have 0 array elems with a 40 interactions
We have 1506 array elems with interactions in (40, 70]
We have 293 array elems with interactions in (70, 100]
We have 92 array elems with interactions in (100, 140]
We have 36 array elems with interactions in (140, 190]
We have 19 array elems with interactions in (190, 1000]
Out[20]:
(array([   0, 1506,  293,   92,   36,   19]),
 array([0.        , 0.77389517, 0.15056526, 0.04727646, 0.01849949,
        0.00976362]))
In [21]:
extreme_warm_users_mask = np.ediff1d(URM_all.tocsr().indptr) > 140
extreme_warm_users = np.arange(URM_all.shape[0])[extreme_warm_users_mask]
extreme_warm_URM = URM_all[extreme_warm_users_mask]
In [22]:
mask = np.in1d(very_warm_users, extreme_warm_users, invert=True)
not_too_warm_users = very_warm_users[mask]

temp = very_warm_URM[mask]
temp = (temp > 0).sum(axis=1)
temp = np.array(temp).squeeze()
temp = np.sort(temp)
plt.plot(temp)
plt.xlabel('User index')
plt.ylabel('Number of interactions')
plt.show()
In [23]:
mask = np.in1d(very_warm_users, extreme_warm_users, invert=True)
mask = np.logical_not(mask)
temp = very_warm_URM[mask]
temp = (temp > 0).sum(axis=1)
temp = np.array(temp).squeeze()
temp = np.sort(temp)
plt.plot(temp)
plt.xlabel('User index')
plt.ylabel('Number of interactions')
plt.show()

Very warm users: focus the analysis on them: what items they like? what is their region/age information?

Let's start from the very outliers: how do item similarity behaves for them? Let's try to see how often they do like popular items, and how often they instead do not prefer them.

In [24]:
def get_pop_proportion(threhsold):
    pop_items = get_popular_items(URM_all, threhsold)
    quantity_of_popular_items_liked = np.zeros(extreme_warm_users.size)
    quantity_of_unpop_items_liked = np.zeros(extreme_warm_users.size)
    for i, user in enumerate(extreme_warm_users):
        items_liked_by_user = URM_all[user].indices
        q_pop = np.in1d(pop_items, items_liked_by_user).sum()
        quantity_of_popular_items_liked[i] = q_pop
        quantity_of_unpop_items_liked[i] = items_liked_by_user.size - q_pop
    return quantity_of_unpop_items_liked, quantity_of_popular_items_liked
In [25]:
quantity_of_unpop_items_liked, quantity_of_popular_items_liked = get_pop_proportion(500)
quantity_of_popular_items_liked
Out[25]:
array([31.,  7., 14., 11.,  2.,  7.,  4., 18., 23., 13., 11., 14.,  9.,
        9., 10.,  9., 17.,  9.,  9., 16.,  4.,  5.,  9., 13.,  5., 15.,
        9.,  3.,  4.,  4., 15., 14.,  4., 13., 12., 13., 12., 11.,  6.,
       10., 23.,  6.,  8., 13., 12., 10., 11., 10., 12., 13., 30., 21.,
       32., 17., 19.])
In [26]:
quantity_of_unpop_items_liked
Out[26]:
array([646., 195., 170., 153., 154., 146., 140., 252., 263., 163., 145.,
       206., 159., 138., 140., 160., 258., 136., 192., 239., 154., 140.,
       152., 144., 189., 132., 146., 155., 142., 191., 143., 128., 144.,
       172., 142., 129., 184., 172., 143., 152., 356., 166., 133., 171.,
       162., 138., 189., 134., 269., 152., 474., 434., 422., 227., 237.])
In [27]:
quantity_of_unpop_items_liked, quantity_of_popular_items_liked = get_pop_proportion(1)
quantity_of_unpop_items_liked
Out[27]:
array([2., 1., 3., 1., 1., 0., 1., 2., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
       2., 1., 1., 4., 2., 0., 0., 1., 0., 0., 2., 0., 0., 0., 0., 5., 1.,
       0., 2., 1., 1., 0., 0., 1., 1., 0., 2., 0., 0., 4., 0., 0., 0., 2.,
       3., 4., 0., 1.])

Some of them, are the only ones who liked a given item.

In [28]:
quantity_of_unpop_items_liked, quantity_of_popular_items_liked = get_pop_proportion(30)
quantity_of_unpop_items_liked
Out[28]:
array([161.,  72.,  48.,  42.,  78.,  35.,  42.,  53.,  70.,  43.,  46.,
        61.,  34.,  40.,  36.,  55.,  42.,  37.,  57.,  63.,  55.,  69.,
        38.,  41.,  64.,  13.,  39.,  47.,  48.,  83.,  26.,  20.,  65.,
        48.,  42.,  47.,  50.,  29.,  31.,  43.,  83.,  66.,  24.,  59.,
        46.,  38.,  57.,  28.,  59.,  22., 140., 114., 107.,  59.,  48.])

Maybe, we should focus for instance here, to obtain a more grained information about their taste.

Similarity analysis

In [29]:
# First let's see how these impact the similarity matrix between users
user_cf = best_models.UserCF.get_model(URM_train=URM_all, load_model=False)
W_sparse = user_cf.W_sparse
UserKNNCFRecommender: URM Detected 3218 (17.40 %) cold items.
Similarity column 27255 ( 100 % ), 1858.15 column/sec, elapsed time 0.24 min
In [30]:
W_sparse[:, 55].data.size # This very active users is in the neighboorhood of many users
Out[30]:
995
In [31]:
W_sparse[55].data.size # And, it has 7525 valid similarity values: its taste is related to many users: it may be strange
Out[31]:
7525
In [32]:
# Let's look at his top neighboorhood
values = np.sort(W_sparse[55].data)
plt.plot(values, 'ro')
plt.xlabel('User index')
plt.ylabel('Similarity value of user 55 (the most extreme)')
plt.show()
In [33]:
top_50_values = values[-50:] 
In [34]:
top_50_similar_users = W_sparse[55].indices[np.isin(W_sparse[55].data, top_50_values)]  # top 50 most similar users 
count_extreme=0
count_very=0
for user in top_50_similar_users:
    if user in extreme_warm_users:
        count_extreme+=1
    if user in very_warm_users:
        count_very+=1
In [35]:
count_extreme
Out[35]:
23
In [36]:
count_very
Out[36]:
47
In [37]:
# Let's look at his top neighboorhood
plt.plot(np.sort(W_sparse[25].data), 'ro')
plt.xlabel('User index')
plt.ylabel('Similarity value of user 25 (with few ratings)')
plt.show()
In [7]:
data_reader = RecSys2019Reader("../data/")
data_reader = DataPreprocessingFeatureEngineering(data_reader,
                                                      ICM_names_to_count=["ICM_sub_class"])
data_reader = DataPreprocessingImputation(data_reader,
                                              ICM_name_to_agg_mapper={"ICM_asset": np.median,
                                                                  "ICM_price": np.median})
data_reader = DataPreprocessingTransform(data_reader,
                                         ICM_name_to_transform_mapper={"ICM_asset": lambda x: np.log1p(1 / x),
                                                                       "ICM_price": lambda x: np.log1p(1 / x),
                                                                       "ICM_item_pop": np.log1p,
                                                                       "ICM_sub_class_count": np.log1p})
data_reader = DataPreprocessingDiscretization(data_reader,
                                              ICM_name_to_bins_mapper={"ICM_asset": 200,
                                                                       "ICM_price": 200,
                                                                       "ICM_item_pop": 50,
                                                                       "ICM_sub_class_count": 50})
data_reader.load_data()
ICM_all = data_reader.get_ICM_from_name("ICM_all") 
item_cbf_cf_all = new_best_models.ItemCBF_CF.get_model(URM_train=URM_all, ICM_train=ICM_all)
RecSys2019Reader: WARNING --> There is no verification in the consistency of UCMs
DataReader: Verifying data consistency...
DataReader: Verifying data consistency... Passed!
DataReader: current dataset is: <class 'src.data_management.RecSys2019Reader.RecSys2019Reader'>
	Number of items: 18495
	Number of users: 27255
	Number of interactions in URM_all: 398636
	Interaction density: 7.91E-04
	Interactions per user:
		 Min: 1.00E+00
		 Avg: 1.46E+01
		 Max: 6.77E+02
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 2.16E+01
		 Max: 8.99E+02
	Gini Index: 0.54

ItemKNNCBFCFRecommender: URM Detected 3218 (17.40 %) cold items.
Similarity column 18495 ( 100 % ), 1997.62 column/sec, elapsed time 0.15 min
In [38]:
scores_extreme = item_cbf_cf_all._compute_item_score(user_id_array=extreme_warm_users)
In [57]:
# Let's look at his top neighboorhood
for i in range(0, scores_extreme.shape[0]):
    plt.plot(np.sort(scores_extreme[i]), 'ro')
    plt.xlabel('Item index')
    plt.ylabel('Scores for user {}'.format(extreme_warm_users[i]))
    plt.show()
In [61]:
mid_mask = np.ediff1d(URM_all.tocsr().indptr) < 30 
mid_users = np.arange(URM_all.shape[0])[mid_mask]
mid_mask = np.ediff1d(URM_all.tocsr().indptr) > 10
mid_users = np.unique(np.intersect1d(mid_users, np.arange(URM_all.shape[0])[mid_mask]))

scores_mid = item_cbf_cf_all._compute_item_score(user_id_array=mid_users)
for i in range(0, scores_mid.shape[0]):
    plt.plot(np.sort(scores_mid[i]), 'ro')
    plt.xlabel('Item index')
    plt.ylabel('Score of users {}'.format(mid_users[i]))
    plt.show()
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-61-f745e6832be5> in <module>
      9     plt.xlabel('Item index')
     10     plt.ylabel('Score of users {}'.format(mid_users[i]))
---> 11     plt.show()

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\pyplot.py in show(*args, **kw)
    267     """
    268     global _show
--> 269     return _show(*args, **kw)
    270 
    271 

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\ipykernel\pylab\backend_inline.py in show(close, block)
     37             display(
     38                 figure_manager.canvas.figure,
---> 39                 metadata=_fetch_figure_metadata(figure_manager.canvas.figure)
     40             )
     41     finally:

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\IPython\core\display.py in display(include, exclude, metadata, transient, display_id, *objs, **kwargs)
    304             publish_display_data(data=obj, metadata=metadata, **kwargs)
    305         else:
--> 306             format_dict, md_dict = format(obj, include=include, exclude=exclude)
    307             if not format_dict:
    308                 # nothing to display (e.g. _ipython_display_ took over)

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\IPython\core\formatters.py in format(self, obj, include, exclude)
    178             md = None
    179             try:
--> 180                 data = formatter(obj)
    181             except:
    182                 # FIXME: log the exception

<C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\decorator.py:decorator-gen-9> in __call__(self, obj)

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\IPython\core\formatters.py in catch_format_error(method, self, *args, **kwargs)
    222     """show traceback on failed format call"""
    223     try:
--> 224         r = method(self, *args, **kwargs)
    225     except NotImplementedError:
    226         # don't warn on NotImplementedErrors

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
    339                 pass
    340             else:
--> 341                 return printer(obj)
    342             # Finally look for special method names
    343             method = get_real_method(obj, self.print_method)

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\IPython\core\pylabtools.py in <lambda>(fig)
    242 
    243     if 'png' in formats:
--> 244         png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs))
    245     if 'retina' in formats or 'png2x' in formats:
    246         png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs))

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\IPython\core\pylabtools.py in print_figure(fig, fmt, bbox_inches, **kwargs)
    126 
    127     bytes_io = BytesIO()
--> 128     fig.canvas.print_figure(bytes_io, **kw)
    129     data = bytes_io.getvalue()
    130     if fmt == 'svg':

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\backend_bases.py in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, **kwargs)
   2080                     orientation=orientation,
   2081                     bbox_inches_restore=_bbox_inches_restore,
-> 2082                     **kwargs)
   2083             finally:
   2084                 if bbox_inches and restore_bbox:

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\backends\backend_agg.py in print_png(self, filename_or_obj, metadata, pil_kwargs, *args, **kwargs)
    525 
    526         else:
--> 527             FigureCanvasAgg.draw(self)
    528             renderer = self.get_renderer()
    529             with cbook._setattr_cm(renderer, dpi=self.figure.dpi), \

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\backends\backend_agg.py in draw(self)
    386         self.renderer = self.get_renderer(cleared=True)
    387         with RendererAgg.lock:
--> 388             self.figure.draw(self.renderer)
    389             # A GUI class may be need to update a window using this draw, so
    390             # don't forget to call the superclass.

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     36                 renderer.start_filter()
     37 
---> 38             return draw(artist, renderer, *args, **kwargs)
     39         finally:
     40             if artist.get_agg_filter() is not None:

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\figure.py in draw(self, renderer)
   1707             self.patch.draw(renderer)
   1708             mimage._draw_list_compositing_images(
-> 1709                 renderer, self, artists, self.suppressComposite)
   1710 
   1711             renderer.close_group('figure')

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\image.py in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    133     if not_composite or not has_images:
    134         for a in artists:
--> 135             a.draw(renderer)
    136     else:
    137         # Composite any adjacent images together

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     36                 renderer.start_filter()
     37 
---> 38             return draw(artist, renderer, *args, **kwargs)
     39         finally:
     40             if artist.get_agg_filter() is not None:

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\axes\_base.py in draw(self, renderer, inframe)
   2645             renderer.stop_rasterizing()
   2646 
-> 2647         mimage._draw_list_compositing_images(renderer, self, artists)
   2648 
   2649         renderer.close_group('axes')

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\image.py in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    133     if not_composite or not has_images:
    134         for a in artists:
--> 135             a.draw(renderer)
    136     else:
    137         # Composite any adjacent images together

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     36                 renderer.start_filter()
     37 
---> 38             return draw(artist, renderer, *args, **kwargs)
     39         finally:
     40             if artist.get_agg_filter() is not None:

C:\Users\Riccardo\Anaconda3\envs\recsys\lib\site-packages\matplotlib\lines.py in draw(self, renderer)
    869                                       fc_rgba)
    870 
--> 871                 alt_marker_path = marker.get_alt_path()
    872                 if alt_marker_path:
    873                     alt_marker_trans = marker.get_alt_transform()

KeyboardInterrupt: 
In [72]:
ICM_subclass = data_reader.get_ICM_from_name("ICM_sub_class") 
In [90]:
subclass_extreme = []
for user in extreme_warm_users:
    liked_item = URM_all[user].indices
    subclass_item_liked = ICM_subclass[liked_item].indices
    subclass_extreme.append(subclass_item_liked)

Gray Sheep Users - Analysis and detection

From "Identifying Grey Sheep Users By The Distribution of UserSimilarities In Collaborative Filtering" - Zheng et. al

The method is divided on 4 steps:

  • Representing distribution of similarities, calculating them, with quantile and skewness information
  • Example selection
  • Outlier detection
  • Examinations
In [96]:
# STEP 1: REPRESENTING DISTRIBUTION OF SIMILARITIES, FOR EACH USER
user_cf = best_models.UserCF.get_model(URM_train=URM_all, load_model=False)
W_sparse = user_cf.W_sparse

sim_q1 = np.zeros(URM_all.shape[0])
sim_q2 = np.zeros(URM_all.shape[0])
sim_q3 = np.zeros(URM_all.shape[0])
sim_mean = np.zeros(URM_all.shape[0])
sim_std = np.zeros(URM_all.shape[0])
sim_skew = np.zeros(URM_all.shape[0])

for i, user in enumerate(np.arange(URM_all.shape[0])):
    if i % 10000 == 0:
        print("Done {} over {}".format(i, URM_all.shape[0]))
    if W_sparse[i].data.size > 0:
        user_similarity = W_sparse[i].data
        quantiles = np.quantile(a=W_sparse[i].data, q=[0.25, 0.5, 0.75])
        sim_q1[i] = quantiles[0]
        sim_q2[i] = quantiles[1]
        sim_q3[i] = quantiles[2]
        sim_mean[i] = W_sparse[i].data.mean()
        sim_std[i] = W_sparse[i].data.std()
        sim_skew[i] = skew(W_sparse[i].data)
UserKNNCFRecommender: URM Detected 3218 (17.40 %) cold items.
Similarity column 27255 ( 100 % ), 1765.47 column/sec, elapsed time 0.26 min
Done 0 over 27255
Done 10000 over 27255
Done 20000 over 27255
In [196]:
# STEP 2: EXAMPLE SELECTIONS (here, I choosed to consider only skewness, otherwise, too few users were selected)

# Bad examples selection
empirical_threshold_skew = np.quantile(a=sim_skew, q=0.75)
bad_examples_users_skew = np.argwhere((sim_skew > empirical_threshold_skew) & (sim_mean > 0)).squeeze()
bad_examples = np.intersect1d(bad_examples_users_skew, bad_examples_users_skew)

include_other_info = False
if include_other_info:
    empt_t_mean = np.quantile(a=sim_mean, q=0.25)
    bad_examples_users_mean = np.argwhere((sim_mean < empt_t_mean) & (sim_mean > 0)).squeeze()
    bad_examples = bad_examples_users_mean

empirical_threshold_skew = np.quantile(a=sim_skew, q=0.25)
good_examples_users_skew = np.argwhere((sim_skew < empirical_threshold_skew) & (sim_mean > 0)).squeeze()
good_examples = good_examples_users_skew

# Debug Content
almost_cold_users_mask = np.ediff1d(URM_all.tocsr().indptr) <= 2
almost_cold_users = np.arange(URM_all.shape[0])[almost_cold_users_mask]

very_warm_users_mask = np.ediff1d(URM_all.tocsr().indptr) > 40
very_warm_users = np.arange(URM_all.shape[0])[very_warm_users_mask]

print("There are {} users with less than 3 interactions in the bad examples, which are {}".format(np.in1d(bad_examples, almost_cold_users).sum(), bad_examples.size))
print("There are {} users with less than 3 interactions in the good examples, which are {}".format(np.in1d(good_examples, almost_cold_users).sum(), good_examples.size))
print("There are {} users with more than 40 interactions in the bad examples, which are {}".format(np.in1d(bad_examples, very_warm_users).sum(), bad_examples.size))
print("There are {} users with more than 40 interactions in the good examples, which are {}".format(np.in1d(good_examples, very_warm_users).sum(), good_examples.size))
There are 1427 users with less than 3 interactions in the bad examples, which are 6814
There are 1329 users with less than 3 interactions in the good examples, which are 6789
There are 0 users with more than 40 interactions in the bad examples, which are 6814
There are 1931 users with more than 40 interactions in the good examples, which are 6789
In [192]:
temp = (URM_all[bad_examples] > 0).sum(axis=1)
temp = np.array(temp).squeeze()
temp = np.sort(temp)
plt.plot(user_act, 'ro')
plt.xlabel('User index')
plt.ylabel('Number of interactions')
plt.show()
In [104]:
sim_skew[extreme_warm_users]
Out[104]:
array([1.01315224, 0.87250429, 0.61851889, 0.57845128, 0.97854471,
       0.53023887, 0.68597174, 0.71689945, 0.6706841 , 0.60098296,
       0.63144863, 0.75916344, 0.77098459, 0.81081313, 0.54241592,
       0.78448927, 0.75647283, 0.65946883, 0.78736913, 0.82845569,
       0.74197155, 0.95049238, 0.63584226, 0.62486207, 0.99186498,
       0.62236267, 0.6934942 , 0.72356975, 0.74211133, 1.00631869,
       0.58245194, 0.6163258 , 0.86409789, 0.90930927, 0.82681751,
       0.68686146, 0.68909252, 0.70121461, 0.46909726, 0.6643036 ,
       0.90188956, 0.74024862, 0.65459281, 0.72987872, 0.70045954,
       0.51678181, 0.7442261 , 0.70406878, 0.92941684, 0.64516145,
       0.95764107, 0.93157417, 0.89828014, 0.65284777, 0.64365715])
In [102]:
users_to_keep = np.ediff1d(URM_all.tocsr().indptr) > 15
users_to_keep = np.arange(URM_all.shape[0])[users_to_keep]
sim_skew[users_to_keep].mean()
Out[102]:
1.4798125697512232
In [106]:
sim_mean[extreme_warm_users]
Out[106]:
array([0.04106272, 0.04420094, 0.05082949, 0.05315977, 0.04463761,
       0.05331155, 0.05116257, 0.05280016, 0.05036769, 0.05127606,
       0.05139428, 0.04584904, 0.05026084, 0.05174049, 0.05903436,
       0.04736799, 0.04982213, 0.05801579, 0.04615551, 0.04876825,
       0.04667097, 0.04871478, 0.05679297, 0.05474585, 0.0472174 ,
       0.06799701, 0.05348776, 0.04856725, 0.05028695, 0.04212175,
       0.05782602, 0.06389102, 0.04835289, 0.04999664, 0.05201786,
       0.05555704, 0.05137462, 0.05109761, 0.05249437, 0.05308927,
       0.04416302, 0.0473983 , 0.05711786, 0.04876477, 0.05136738,
       0.05500605, 0.04800225, 0.05771661, 0.04493921, 0.05565838,
       0.04141199, 0.04124196, 0.04408622, 0.04782335, 0.05164096])
In [107]:
sim_mean.mean()
Out[107]:
0.12248899073578681

As we can see, this users have a mean of the similarity to the similarity with other users which is very low. Skewness, however, seems not be present for their similarity distribution. Compared at least, to the one of other users. What happens, if we take them away while training? Do the scores improves?

In [111]:
sim_mean.std()
Out[111]:
0.04803820571530061

However, they do not seems to be really outliers using (mean - 3*std). However, one should note that an ack problem of this method is that it can fails: "This method can fail to detect outliers because the outliers increase the standard deviation. The more extreme the outlier, the more the standard deviation is affected."

In [ ]: